Fault TOLERANCE IN GRID COMPUTING: STATE OF THE ART AND OPEN ISSUES
نویسندگان
چکیده
Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in grid computing. Commonly utilized techniques for providing fault tolerance are job checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. In case of complex scientific workflows where tasks can execute in well defined order reliability is another biggest challenge because of the unreliable nature of the grid resources.
منابع مشابه
Fault-tolerant behavior in state-of-the-art Grid Workflow Management Systems
While the workflow paradigm, emerged from the field of business processes, has been proven to be the most successful paradigm for creating scientific applications for execution also on Grid infrastructures, most of the current Grid workflow management systems still cannot deliver the quality, robustness and reliability that are needed for widespread acceptance as tools used on a day-to-day basi...
متن کاملTechniques for Designing Survivable Optical Grid Networks
Grid computing involves high performance computing with resource sharing to support data-intensive applications, and requires high speed communications. Wavelength division multiplexing (WDM) optical networks become a natural choice for interconnecting the distributed computational and/or storage resources due to their high throughput, high reliability and low cost. This has led to increased re...
متن کاملAchieving QoS in Highly Unreliable Grid Environments
Grids can form the basis for pervasive computing due to their ability of being open, scalable, and flexible to various changes (from topology changes to unpredicted failures of nodes). However, such environments are prone to failures due to their nature and need a certain level of reliability in order to provide viable and commercially exploitable solutions. This is causing nowadays a significa...
متن کاملContext Prediction based on Context Histories: Expected Benefits, Issues and Current State-of-the-Art
This paper presents the topic of context prediction as one possibility to exploit context histories. It lists some expected benefits of context prediction for certain application areas and discusses the associated issues in terms of accuracy, fault tolerance, unobtrusive operation, user acceptance, problem complexity and privacy. After identifying the challenges in context prediction, a first a...
متن کاملA survey on virtual machine migration and server consolidation frameworks for cloud data centers
Modern Cloud Data Centers exploit virtualization for efficient resource management to reduce cloud computational cost and energy budget. Virtualization empowered by virtual machine (VM) migration meets the ever increasing demands of dynamic workload by relocating VMs within Cloud Data Centers. VM migration helps successfully achieve various resource management objectives such as load balancing,...
متن کامل